Implement the new tuning API for DeviceScan#7565
Implement the new tuning API for DeviceScan#7565griwes wants to merge 38 commits intoNVIDIA:mainfrom
DeviceScan#7565Conversation
This comment has been minimized.
This comment has been minimized.
bernhardmgruber
left a comment
There was a problem hiding this comment.
This looks really good already! Great work!
…feature/new-tuning-api/scan
|
Note, the warpspeed integration is still largely untested; I've added an rtxpro6000 test job to c.parallel and that will be the primary test right now. I'll lease a machine with a relevant GPU if that fails, or if there's anything that's clearly wrong to someone's eyes in review. Edit: also seems I messed up some constexprness 😅 |
This comment has been minimized.
This comment has been minimized.
|
Last remaining real failure is SASS checks in non-scan c.parallel tests on sm120; I'll pull that out of this PR, together with the enablement of the config in CI, and post it separately. |
This comment has been minimized.
This comment has been minimized.
bernhardmgruber
left a comment
There was a problem hiding this comment.
I still have to re-review the dispatch logic and the changes around the kernel, especially the refactoring to compute whether we can fit a single stage into 48KiB SMEM. Otherwise this looks pretty good already!
Ideally, we should not see any SASS changes for SM 75;80;86;90;100 for one of the benchmarks, like cub.bench.scan.sum.base. Can you please diff a SASS dump before and after the PR and confirm this? Thx!
| { | ||
| static constexpr int num_squads = 5; | ||
|
|
||
| bool valid = false; |
There was a problem hiding this comment.
Remark: we should probably introduce an algorithm enum like in DeviceTransform before all the policies go public. No changes need for now.
| return dispatch_arch(policy_selector, arch_id, [&](auto policy_getter) { | ||
| return DispatchScan<InputIteratorT, | ||
| OutputIteratorT, | ||
| ScanOpT, | ||
| InitValueT, | ||
| OffsetT, | ||
| AccumT, | ||
| EnforceInclusive, | ||
| fake_policy, | ||
| KernelSource, | ||
| KernelLauncherFactory>{ | ||
| d_temp_storage, | ||
| temp_storage_bytes, | ||
| d_in, | ||
| d_out, | ||
| num_items, | ||
| scan_op, | ||
| init_value, | ||
| stream, | ||
| -1 /* ptx_version, not used actually */, | ||
| kernel_source, | ||
| launcher_factory} | ||
| .__invoke(policy_getter, policy_selector); | ||
| }); |
There was a problem hiding this comment.
Remark: I wonder if it would have been easier to duplicate the logic from DispatchScan into the dispatch function and strip all warpspeed logic from DispatchScan. The warpspeed scan is not on a release branch yet, so it's fine if it's not reachable through DispatchScan.
|
I finished another review and I only have minor comments except for the wish to retain the static assert that one stage fits into SMEM. I am now waiting for confirmation that we don't see SASS changes. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
I have been thinking a bit about how the check whether a single stage fits into 48KiB SMEM, and I wondered whether we actually need this check in CCCL.C. The main purpose of the check is to ensure forward compatibility of compiled binaries. So if you compile for The second reason we have this check is that a user could provide us with an input type, or an accumulator type (as dictated by the scan operator), that is so huge that we go beyond 48KiB SMEM even with a conservative tuning policy, and we should just fall back to the old scan, because it's not possible to run the warpspeed scan. Now I wondered, is the set of types that CCCL.C will use open or closed? Because if we know all types that warpspeed scan will be used from CCCL.C, we can just test if it fits into SMEM in a unit test and entirely omit the entire compile time checking for CCCL.C. We would just drop the SMEM check from the |
|
I just realized we still need the runtime computation to know how much SMEM we must request :S |
This comment has been minimized.
This comment has been minimized.
|
There is SASS changes. Here's a random assortment of kernels compared: https://gist.github.com/griwes/a94e3daf0d2b58faaeebea1932e0c1b0. I believe that there's a whole bunch of codegen artifacts here + some loss/gain of uniform instructions (presumably because the changes made it both easier and harder for the compiler to reason about uniformity...). I have not spotted any significant changes in the hot paths. There's also two specific cases that seem to now be producing LMEM instructions, though as far as I can tell it's not in the hot loop either: https://gist.github.com/griwes/e0bc6107675b9a55fc3efabdc7244564. |
🥳 CI Workflow Results🟩 Finished in 2h 13m: Pass: 100%/255 | Total: 8d 19h | Max: 2h 12m | Hits: 59%/159900See results here. |
Description
Resolves #7521
Resolves #7476
Resolves #6821
Ready for review, still planning to do SASS inspection in some crucial places.
Sidenote: this exact type of task seems to fit Codex really, really well.
Checklist